The KPMG Bicycle Problem¶
- Introduction
- Task 1 : Data Quality Assessment
- Data Quality Framework
- Importing the data
- Analyzing the dataframes
- Task 2 : Data Insights
- Task 2 : RFM Analysis
- Task 2 : Geospatial Analysis
- Task 3 : Model Development
- End Notes
- References
Introduction¶
A classic example of customer segmentaion analysis.
This is a part of the KPMG Job Simulation Program, it used to be availible the forage but is archived now.
This notebook is a summary and updated version of the past three notebooks I've created before which can be visited here:
Background :¶
Apparently I’m finally working at KPMG (please hire me) and I just got my first client, it’s a medium-sized bikes and cycling accessories company, Sprocket Central Pty Ltd.
Me along with my Analytics, Information and modelling team are supposed to help them skyrocket their business 📈👌
They provided me with three datasets:
- Customer Demographic
- Customer Addresses
- Transactions data in the past 3 months
But there are issues with their data, of course, it can not be used to build some magical model right away, also my Associate Director suggested that I should optimise the quality of the customer data before deriving any kind of insights for the company growth.
Which brings us towards our first task !
Task 1 : Data Quality Assessment¶
Fix the issues we encounter in all three datasets, and draft an email to the client identifying the data quality issues and ways to mitigate them.
At the end of this task, we would have it cleaned, made it accurate and consistent, and ready to be used for our analysis.
Data Quality Framework¶
| Framework | Description |
|---|---|
| Accuracy | The closeness between a value to its correct representation of the real-life phenomenon |
| Completeness | The extent to which data are of sufficient breadth, depth, and scope for the task at hand |
| Consistency | The extent to which data are uniform in format, use, and meaning across a data collection |
| Currency | The freshness of data |
| Volatility | The length of time the data remains valid |
| Relevancy | The extent to which data are appropriate for the task at hand |
| Validity | The extent to which data conform to defined business rules or constraints |
| Uniqueness | The extent to which data are unique within the dataset |
Importing the data¶
We will be using pandas libraries to import the data and perform our analysis. You can also use excel or google sheets and whatever tools you're comfortable with.
# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
xls = pd.ExcelFile(
"/home/meow/Desktop/internship-speedrun/kpmg/KPMG_VI_New_raw_data_update_final.xlsx"
)
# I'm using black formatting for the code
# Keep note of the naming convention
# We used PascalCase to name all out original datasets
Transactions = pd.read_excel(xls, "Transactions", skiprows=1)
CustomerDemographic = pd.read_excel(xls, "CustomerDemographic", skiprows=1)
CustomerAddress = pd.read_excel(xls, "CustomerAddress", skiprows=1)
NewCustomerList = pd.read_excel(xls, "NewCustomerList", skiprows=1)
Analyzing the dataframes¶
We will analyze the dataframes and see what issues we can find in them.
- Shape and Cardinality
- Missing Values
- Duplicates
- Data Types
Solutions and mitigation strategies are available at the end of each section.
Shape and Cardinality¶
- Shape of the data is 20000 rows and 13 columns
- Columns with cardinality 2 should be converted to boolean
- This would be useful for non machine learning approaches
Transactions.head()
| transaction_id | product_id | customer_id | transaction_date | online_order | order_status | brand | product_line | product_class | product_size | list_price | standard_cost | product_first_sold_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 2950 | 2017-02-25 | 0.0 | Approved | Solex | Standard | medium | medium | 71.49 | 53.62 | 41245.0 |
| 1 | 2 | 3 | 3120 | 2017-05-21 | 1.0 | Approved | Trek Bicycles | Standard | medium | large | 2091.47 | 388.92 | 41701.0 |
| 2 | 3 | 37 | 402 | 2017-10-16 | 0.0 | Approved | OHM Cycles | Standard | low | medium | 1793.43 | 248.82 | 36361.0 |
| 3 | 4 | 88 | 3135 | 2017-08-31 | 0.0 | Approved | Norco Bicycles | Standard | medium | medium | 1198.46 | 381.10 | 36145.0 |
| 4 | 5 | 78 | 787 | 2017-10-01 | 1.0 | Approved | Giant Bicycles | Standard | medium | large | 1765.30 | 709.48 | 42226.0 |
Transactions.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 transaction_id 20000 non-null int64 1 product_id 20000 non-null int64 2 customer_id 20000 non-null int64 3 transaction_date 20000 non-null datetime64[ns] 4 online_order 19640 non-null float64 5 order_status 20000 non-null object 6 brand 19803 non-null object 7 product_line 19803 non-null object 8 product_class 19803 non-null object 9 product_size 19803 non-null object 10 list_price 20000 non-null float64 11 standard_cost 19803 non-null float64 12 product_first_sold_date 19803 non-null float64 dtypes: datetime64[ns](1), float64(4), int64(3), object(5) memory usage: 2.0+ MB
# List unique values of all the columns with ascending order of cardinality
# Cardinality is the number of unique values in a column
# This is a good way to check for categorical columns
for col in Transactions.columns:
print(f"{col}: {Transactions[col].nunique()}")
print("------------------------------------")
# Print top 7 unique values of all the columns with ascending order of cardinality if less than 7
for col in Transactions.columns:
if Transactions[col].nunique() < 10:
print(f"{col}: {Transactions[col].unique()}")
transaction_id: 20000 product_id: 101 customer_id: 3494 transaction_date: 364 online_order: 2 order_status: 2 brand: 6 product_line: 4 product_class: 3 product_size: 3 list_price: 296 standard_cost: 103 product_first_sold_date: 100 ------------------------------------ online_order: [ 0. 1. nan] order_status: ['Approved' 'Cancelled'] brand: ['Solex' 'Trek Bicycles' 'OHM Cycles' 'Norco Bicycles' 'Giant Bicycles' 'WeareA2B' nan] product_line: ['Standard' 'Road' 'Mountain' 'Touring' nan] product_class: ['medium' 'low' 'high' nan] product_size: ['medium' 'large' 'small' nan]
Missing Values and Duplicate Rows¶
- The number of missing values in this dataframe are insignificant and from the same rows, we can drop them.
- There are no duplicate rows
# Function to analyse the missing data
def analyze_null_values(dataframe):
total_null_values = dataframe.isnull().sum()
total_values = dataframe.count().sort_values(ascending=True)
null_values_percentage = (total_null_values / total_values) * 100
missing_data = pd.concat(
{
"Null Values": total_null_values,
"Percentage of Missing Values": null_values_percentage,
"Total Values": total_values,
},
axis=1,
)
missing_data = missing_data.sort_values(
by="Percentage of Missing Values", ascending=False
)
return missing_data
analyze_null_values(Transactions)
| Null Values | Percentage of Missing Values | Total Values | |
|---|---|---|---|
| online_order | 360 | 1.832994 | 19640 |
| brand | 197 | 0.994799 | 19803 |
| product_line | 197 | 0.994799 | 19803 |
| product_class | 197 | 0.994799 | 19803 |
| product_size | 197 | 0.994799 | 19803 |
| standard_cost | 197 | 0.994799 | 19803 |
| product_first_sold_date | 197 | 0.994799 | 19803 |
| transaction_id | 0 | 0.000000 | 20000 |
| product_id | 0 | 0.000000 | 20000 |
| customer_id | 0 | 0.000000 | 20000 |
| transaction_date | 0 | 0.000000 | 20000 |
| order_status | 0 | 0.000000 | 20000 |
| list_price | 0 | 0.000000 | 20000 |
# Show the number of duplicated rows
Transactions.duplicated().sum()
0
Date Time Adjustments¶
transaction_dateshould be converted to datetime formatproduct_first_sold_dateshould be converted to datetime format- we will later make a new
product_agecolumn by converting theproduct_first_sold_dateto datetime format and subtracting it from thetoday_date
# Define start and end date of the dataset
start_date = pd.to_datetime("2017-01-01")
end_date = pd.to_datetime("2017-12-31")
# Convert transaction_date column to standard datetime format
Transactions["transaction_date"] = pd.to_datetime(Transactions["transaction_date"])
# Convert product_first_sold_date column to standard datetime format
# We need to add the timedelta to the date because the date is stored as a number of days since 1900-01-01
# Assuming the dataset was given to us on 2018-01-01
today_date = pd.Timestamp("2018-01-01")
Transactions["product_first_sold_date"] = pd.to_timedelta(
Transactions["product_first_sold_date"], unit="D"
) + pd.Timestamp("1900-01-01")
date_difference = pd.Timestamp.today() - today_date
Transactions["product_first_sold_date"] = (
Transactions["product_first_sold_date"] - date_difference
)
Transactions.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 transaction_id 20000 non-null int64 1 product_id 20000 non-null int64 2 customer_id 20000 non-null int64 3 transaction_date 20000 non-null datetime64[ns] 4 online_order 19640 non-null float64 5 order_status 20000 non-null object 6 brand 19803 non-null object 7 product_line 19803 non-null object 8 product_class 19803 non-null object 9 product_size 19803 non-null object 10 list_price 20000 non-null float64 11 standard_cost 19803 non-null float64 12 product_first_sold_date 19803 non-null datetime64[ns] dtypes: datetime64[ns](2), float64(3), int64(3), object(5) memory usage: 2.0+ MB
Fixing the Transactions Data¶
- We already fixed the date time format earlier
- We will convert the
online_orderandorder_statuscolumns to object type and map them as True/False - We have their datatype as object, even though they are boolean, because it'll be helpful for machine learning models
- Don't worry about the
NaNvalues, pipelines will take care of them
# Mapping the online_order column to True and False
Transactions["online_order"] = Transactions["online_order"].map({1: True, 0: False})
# Mapping the order_status column to True and False
Transactions["order_status"] = Transactions["order_status"].map(
{"Approved": True, "Cancelled": False}
)
# Print unique values of online_order and order_status columns
print("Unique values of online_order column:", Transactions["online_order"].unique())
print("Unique values of order_status column:", Transactions["order_status"].unique())
Unique values of online_order column: [False True nan] Unique values of order_status column: [ True False]
Shape and Cardinality¶
- Shape of the data is 4000 rows and 13 columns
owns_carcolumn should be converted to booleandeceased_indicatorcolumn should be converted to booleangendercolumn should be converted to booleandefaultcolumn should be dropped as it has no legible data
CustomerDemographic.head(5)
| customer_id | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | default | owns_car | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Laraine | Medendorp | F | 93 | 1953-10-12 00:00:00 | Executive Secretary | Health | Mass Customer | N | "' | Yes | 11.0 |
| 1 | 2 | Eli | Bockman | Male | 81 | 1980-12-16 00:00:00 | Administrative Officer | Financial Services | Mass Customer | N | <script>alert('hi')</script> | Yes | 16.0 |
| 2 | 3 | Arlin | Dearle | Male | 61 | 1954-01-20 00:00:00 | Recruiting Manager | Property | Mass Customer | N | 2018-02-01 00:00:00 | Yes | 15.0 |
| 3 | 4 | Talbot | NaN | Male | 33 | 1961-10-03 00:00:00 | NaN | IT | Mass Customer | N | () { _; } >_[$($())] { touch /tmp/blns.shellsh... | No | 7.0 |
| 4 | 5 | Sheila-kathryn | Calton | Female | 56 | 1977-05-13 00:00:00 | Senior Editor | NaN | Affluent Customer | N | NIL | Yes | 8.0 |
CustomerDemographic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4000 entries, 0 to 3999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 4000 non-null int64 1 first_name 4000 non-null object 2 last_name 3875 non-null object 3 gender 4000 non-null object 4 past_3_years_bike_related_purchases 4000 non-null int64 5 DOB 3913 non-null object 6 job_title 3494 non-null object 7 job_industry_category 3344 non-null object 8 wealth_segment 4000 non-null object 9 deceased_indicator 4000 non-null object 10 default 3698 non-null object 11 owns_car 4000 non-null object 12 tenure 3913 non-null float64 dtypes: float64(1), int64(2), object(10) memory usage: 406.4+ KB
# List unique values of all the columns with ascending order of cardinality
# Cardinality is the number of unique values in a column
# This is a good way to check for categorical columns
for col in CustomerDemographic.columns:
print(f"{col}: {CustomerDemographic[col].nunique()}")
print("------------------------------------")
# Print top 7 unique values of all the columns with ascending order of cardinality if less than 7
for col in CustomerDemographic.columns:
if CustomerDemographic[col].nunique() < 10:
print(f"{col}: {CustomerDemographic[col].unique()}")
customer_id: 4000 first_name: 3139 last_name: 3725 gender: 6 past_3_years_bike_related_purchases: 100 DOB: 3448 job_title: 195 job_industry_category: 9 wealth_segment: 3 deceased_indicator: 2 default: 90 owns_car: 2 tenure: 22 ------------------------------------ gender: ['F' 'Male' 'Female' 'U' 'Femal' 'M'] job_industry_category: ['Health' 'Financial Services' 'Property' 'IT' nan 'Retail' 'Argiculture' 'Manufacturing' 'Telecommunications' 'Entertainment'] wealth_segment: ['Mass Customer' 'Affluent Customer' 'High Net Worth'] deceased_indicator: ['N' 'Y'] owns_car: ['Yes' 'No']
Null Values and Duplicate Rows¶
job_titlehas 506 missing valuesjob_industry_categoryhas 656 missing valueslast_namehas 125 missing values this whole column can be dropped since it wont be useful for our analysisdefaulthas 3027 missing values this whole column can be dropped since it wont be useful for our analysistenurehas 87 missing valuesDOBhas 87 missing values- There are no duplicate rows
# Call the function to analyze the missing data
analyze_null_values(CustomerDemographic)
| Null Values | Percentage of Missing Values | Total Values | |
|---|---|---|---|
| job_industry_category | 656 | 19.617225 | 3344 |
| job_title | 506 | 14.481969 | 3494 |
| default | 302 | 8.166577 | 3698 |
| last_name | 125 | 3.225806 | 3875 |
| DOB | 87 | 2.223358 | 3913 |
| tenure | 87 | 2.223358 | 3913 |
| customer_id | 0 | 0.000000 | 4000 |
| first_name | 0 | 0.000000 | 4000 |
| gender | 0 | 0.000000 | 4000 |
| past_3_years_bike_related_purchases | 0 | 0.000000 | 4000 |
| wealth_segment | 0 | 0.000000 | 4000 |
| deceased_indicator | 0 | 0.000000 | 4000 |
| owns_car | 0 | 0.000000 | 4000 |
# Show the number of duplicated rows
CustomerDemographic.duplicated().sum()
0
Date Time Adjustments¶
DOBshould be converted to datetime formattenureshould be converted to integer format
# Convert DOB column to standard datetime format
CustomerDemographic["DOB"] = pd.to_datetime(CustomerDemographic["DOB"])
CustomerDemographic["tenure"] = CustomerDemographic["tenure"].astype("Int64")
Fixing the Customer Demographic Data¶
- We already fixed the
DOBandtenurecolumns earlier - We will convert the
owns_caranddeceased_indicatorcolumns to object type and map them as True/False - We have their datatype as object, even though they are boolean, because it'll be helpful for machine learning models
- We will map the
gendercolumn to M and F and make it consistent
# Converting own_car column to True and False
CustomerDemographic["owns_car"] = CustomerDemographic["owns_car"].map(
{"Yes": True, "No": False}
)
CustomerDemographic["owns_car"] = CustomerDemographic["owns_car"].astype("object")
# Converting deceased_indicator column to True and False
CustomerDemographic["deceased_indicator"] = CustomerDemographic[
"deceased_indicator"
].map({"Y": True, "N": False})
CustomerDemographic['deceased_indicator'] = CustomerDemographic['deceased_indicator'].astype('object')
# Converting Gender column to M and F and U with np.nan
CustomerDemographic["gender"] = CustomerDemographic["gender"].replace("Female", "F")
CustomerDemographic["gender"] = CustomerDemographic["gender"].replace("Femal", "F")
CustomerDemographic["gender"] = CustomerDemographic["gender"].replace("Male", "M")
CustomerDemographic["gender"] = CustomerDemographic["gender"].replace("U", np.nan)
# Drop the default column
CustomerDemographic.drop(columns="default", inplace=True)
Shape and Cardinality¶
- Shape of the data is 3999 rows and 6 columns
statecolumn needs to be mapped to State Initials
CustomerAddress.head(5)
| customer_id | address | postcode | state | country | property_valuation | |
|---|---|---|---|---|---|---|
| 0 | 1 | 060 Morning Avenue | 2016 | New South Wales | Australia | 10 |
| 1 | 2 | 6 Meadow Vale Court | 2153 | New South Wales | Australia | 10 |
| 2 | 4 | 0 Holy Cross Court | 4211 | QLD | Australia | 9 |
| 3 | 5 | 17979 Del Mar Point | 2448 | New South Wales | Australia | 4 |
| 4 | 6 | 9 Oakridge Court | 3216 | VIC | Australia | 9 |
CustomerAddress.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3999 entries, 0 to 3998 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 3999 non-null int64 1 address 3999 non-null object 2 postcode 3999 non-null int64 3 state 3999 non-null object 4 country 3999 non-null object 5 property_valuation 3999 non-null int64 dtypes: int64(3), object(3) memory usage: 187.6+ KB
# Cardinailty of the columns
for col in CustomerAddress.columns:
print(f"{col}: {CustomerAddress[col].nunique()}")
print("------------------------------------")
# Print top 7 unique values of all the columns with ascending order of cardinality if less than 7
for col in CustomerAddress.columns:
if CustomerAddress[col].nunique() < 10:
print(f"{col}: {CustomerAddress[col].unique()}")
customer_id: 3999 address: 3996 postcode: 873 state: 5 country: 1 property_valuation: 12 ------------------------------------ state: ['New South Wales' 'QLD' 'VIC' 'NSW' 'Victoria'] country: ['Australia']
Missing Values and Duplicate Rows¶
- no missing values
- no duplicate rows
analyze_null_values(CustomerAddress)
| Null Values | Percentage of Missing Values | Total Values | |
|---|---|---|---|
| customer_id | 0 | 0.0 | 3999 |
| address | 0 | 0.0 | 3999 |
| postcode | 0 | 0.0 | 3999 |
| state | 0 | 0.0 | 3999 |
| country | 0 | 0.0 | 3999 |
| property_valuation | 0 | 0.0 | 3999 |
# Print the number of duplicated rows
CustomerAddress.duplicated().sum()
0
Fixing the Customer Address Data¶
- There's no date time adjustments to be made
- We will map the
statecolumn to State Initials
# Mapping New South Wales to NSW and Victoria to VIC
CustomerAddress["state"] = CustomerAddress["state"].replace("New South Wales", "NSW")
CustomerAddress["state"] = CustomerAddress["state"].replace("Victoria", "VIC")
Shape and Cardinality¶
- Shape of the data is 1000 rows and 23 columns
owns_carcolumn should be converted to booleandeceased_indicatorcolumn should be converted to booleangendercolumn should be converted to booleanunnamedcolumns should be dropped as it has no legible data
# Set max column display to None
pd.set_option("display.max_columns", None)
NewCustomerList.sample(5)
| first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | address | postcode | state | country | property_valuation | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Rank | Value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 505 | Milty | Brauninger | Male | 13 | 1945-07-10 | Payment Adjustment Coordinator | NaN | High Net Worth | N | No | 11 | 54 6th Trail | 2640 | NSW | Australia | 7 | 0.69 | 0.69 | 0.690 | 0.69000 | 504 | 504 | 0.850000 |
| 647 | Aldin | Newsome | Male | 24 | 1981-07-06 | Financial Analyst | Financial Services | Mass Customer | N | No | 17 | 058 Morningstar Center | 2127 | NSW | Australia | 9 | 0.90 | 0.90 | 1.125 | 0.95625 | 648 | 648 | 0.725000 |
| 216 | Rosalinde | Cubuzzi | Female | 50 | 1997-09-04 | Business Systems Development Analyst | NaN | Mass Customer | N | No | 5 | 6 Lotheville Trail | 2444 | NSW | Australia | 7 | 0.75 | 0.75 | 0.750 | 0.63750 | 215 | 215 | 1.128906 |
| 235 | Rolland | Esmead | Male | 41 | 1940-07-13 | NaN | Health | Affluent Customer | N | No | 8 | 72008 7th Avenue | 2200 | NSW | Australia | 8 | 0.93 | 0.93 | 0.930 | 0.93000 | 233 | 233 | 1.100000 |
| 342 | Tannie | Petrakov | Male | 84 | 1951-11-27 | Data Coordiator | IT | Affluent Customer | N | No | 10 | 691 Valley Edge Alley | 4078 | QLD | Australia | 6 | 0.46 | 0.46 | 0.575 | 0.57500 | 341 | 341 | 0.990000 |
NewCustomerList.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 first_name 1000 non-null object 1 last_name 971 non-null object 2 gender 1000 non-null object 3 past_3_years_bike_related_purchases 1000 non-null int64 4 DOB 983 non-null object 5 job_title 894 non-null object 6 job_industry_category 835 non-null object 7 wealth_segment 1000 non-null object 8 deceased_indicator 1000 non-null object 9 owns_car 1000 non-null object 10 tenure 1000 non-null int64 11 address 1000 non-null object 12 postcode 1000 non-null int64 13 state 1000 non-null object 14 country 1000 non-null object 15 property_valuation 1000 non-null int64 16 Unnamed: 16 1000 non-null float64 17 Unnamed: 17 1000 non-null float64 18 Unnamed: 18 1000 non-null float64 19 Unnamed: 19 1000 non-null float64 20 Unnamed: 20 1000 non-null int64 21 Rank 1000 non-null int64 22 Value 1000 non-null float64 dtypes: float64(5), int64(6), object(12) memory usage: 179.8+ KB
# Cardinality of the columns
for col in NewCustomerList.columns:
print(f"{col}: {NewCustomerList[col].nunique()}")
print("------------------------------------")
# Print top 7 unique values of all the columns with ascending order of cardinality if less than 7
for col in NewCustomerList.columns:
if NewCustomerList[col].nunique() < 10:
print(f"{col}: {NewCustomerList[col].unique()}")
first_name: 940 last_name: 961 gender: 3 past_3_years_bike_related_purchases: 100 DOB: 961 job_title: 184 job_industry_category: 9 wealth_segment: 3 deceased_indicator: 1 owns_car: 2 tenure: 23 address: 1000 postcode: 522 state: 3 country: 1 property_valuation: 12 Unnamed: 16: 71 Unnamed: 17: 132 Unnamed: 18: 183 Unnamed: 19: 321 Unnamed: 20: 324 Rank: 324 Value: 324 ------------------------------------ gender: ['Male' 'Female' 'U'] job_industry_category: ['Manufacturing' 'Property' 'Financial Services' 'Entertainment' 'Retail' 'IT' 'Telecommunications' 'Health' nan 'Argiculture'] wealth_segment: ['Mass Customer' 'Affluent Customer' 'High Net Worth'] deceased_indicator: ['N'] owns_car: ['Yes' 'No'] state: ['QLD' 'NSW' 'VIC'] country: ['Australia']
Missing Values and Duplicate Rows¶
job_titlehas 106 missing valuesjob_industry_categoryhas 165 missing valueslast_namehas 29 missing values this whole column can be dropped since it wont be useful for our analysis- there are no duplicate rows
# Call the function to analyze the missing data
analyze_null_values(NewCustomerList)
| Null Values | Percentage of Missing Values | Total Values | |
|---|---|---|---|
| job_industry_category | 165 | 19.760479 | 835 |
| job_title | 106 | 11.856823 | 894 |
| last_name | 29 | 2.986612 | 971 |
| DOB | 17 | 1.729400 | 983 |
| first_name | 0 | 0.000000 | 1000 |
| country | 0 | 0.000000 | 1000 |
| Rank | 0 | 0.000000 | 1000 |
| Unnamed: 20 | 0 | 0.000000 | 1000 |
| Unnamed: 19 | 0 | 0.000000 | 1000 |
| Unnamed: 18 | 0 | 0.000000 | 1000 |
| Unnamed: 17 | 0 | 0.000000 | 1000 |
| Unnamed: 16 | 0 | 0.000000 | 1000 |
| property_valuation | 0 | 0.000000 | 1000 |
| address | 0 | 0.000000 | 1000 |
| state | 0 | 0.000000 | 1000 |
| postcode | 0 | 0.000000 | 1000 |
| tenure | 0 | 0.000000 | 1000 |
| owns_car | 0 | 0.000000 | 1000 |
| deceased_indicator | 0 | 0.000000 | 1000 |
| wealth_segment | 0 | 0.000000 | 1000 |
| past_3_years_bike_related_purchases | 0 | 0.000000 | 1000 |
| gender | 0 | 0.000000 | 1000 |
| Value | 0 | 0.000000 | 1000 |
# Check for duplicated rows
NewCustomerList.duplicated().sum()
0
Date Time Adjustments¶
DOBshould be converted to datetime format
# Convert DOB column to standard datetime format
NewCustomerList["DOB"] = pd.to_datetime(NewCustomerList["DOB"])
Fixing the New Customer List Data¶
- We already fixed the
DOBcolumn earlier - We will convert the
owns_caranddeceased_indicatorcolumns to object type and map them as True/False - We will map the
gendercolumn to M and F and make it consistent - We will drop the
unnamedcolumns
# Fixing own_car column
NewCustomerList["owns_car"] = NewCustomerList["owns_car"].replace("Yes", True)
NewCustomerList["owns_car"] = NewCustomerList["owns_car"].replace("No", False)
NewCustomerList["owns_car"] = NewCustomerList["owns_car"].astype("object")
# Fixing deceased_indicator column
NewCustomerList["deceased_indicator"] = NewCustomerList["deceased_indicator"].replace("N", False)
NewCustomerList["deceased_indicator"] = NewCustomerList["deceased_indicator"].astype("object")
# Fixing gender column
NewCustomerList["gender"] = NewCustomerList["gender"].replace("Female", "F")
NewCustomerList["gender"] = NewCustomerList["gender"].replace("Male", "M")
NewCustomerList["gender"] = NewCustomerList["gender"].replace("U", np.nan)
# Dropping all unnamed columns
NewCustomerList.drop(columns="Unnamed: 16", inplace=True)
NewCustomerList.drop(columns="Unnamed: 17", inplace=True)
NewCustomerList.drop(columns="Unnamed: 18", inplace=True)
NewCustomerList.drop(columns="Unnamed: 19", inplace=True)
NewCustomerList.drop(columns="Unnamed: 20", inplace=True)
NewCustomerList.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 first_name 1000 non-null object 1 last_name 971 non-null object 2 gender 983 non-null object 3 past_3_years_bike_related_purchases 1000 non-null int64 4 DOB 983 non-null datetime64[ns] 5 job_title 894 non-null object 6 job_industry_category 835 non-null object 7 wealth_segment 1000 non-null object 8 deceased_indicator 1000 non-null object 9 owns_car 1000 non-null object 10 tenure 1000 non-null int64 11 address 1000 non-null object 12 postcode 1000 non-null int64 13 state 1000 non-null object 14 country 1000 non-null object 15 property_valuation 1000 non-null int64 16 Rank 1000 non-null int64 17 Value 1000 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(5), object(11) memory usage: 140.8+ KB
Task 2 : Data Insights¶
Now that we have proposed a solution to the data quality issues, we can now use the data to derive insights.
- We will merge the dataframes to make a master dataframe
- We will find something to do with the postal codes and adresses
- We will do a RFM analysis and make customer clusters
- We will use geopandas to draw customer location heatmaps
- We will use univariate and multivariate analysis to find insights
Merging the Dataframes¶
We will merge the three dataframes to make a master dataframe, we will use the customer_id column as the key.
Merging Strategy¶
- Looks like
customer_idis the common column in all three traingin dataframes - NewCustomerList has no
customer_idcolumn since they are new customers - We will use
customer_idas the primary key to merge the dataframes - Since we don't want to deal with duplicate columns, we will use inner join
# Display all the common columns in all the datasets
common_columns = list(
set(Transactions.columns)
& set(CustomerDemographic.columns)
& set(CustomerAddress.columns)
)
print(common_columns)
['customer_id']
# Inner join all the datasets
_ = pd.merge(
Transactions,
CustomerDemographic,
how="inner",
left_on="customer_id",
right_on="customer_id",
)
MasterDataset = pd.merge(
_,
CustomerAddress,
how="inner",
left_on="customer_id",
right_on="customer_id",
)
MasterDataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19968 entries, 0 to 19967 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 transaction_id 19968 non-null int64 1 product_id 19968 non-null int64 2 customer_id 19968 non-null int64 3 transaction_date 19968 non-null datetime64[ns] 4 online_order 19609 non-null object 5 order_status 19968 non-null bool 6 brand 19773 non-null object 7 product_line 19773 non-null object 8 product_class 19773 non-null object 9 product_size 19773 non-null object 10 list_price 19968 non-null float64 11 standard_cost 19773 non-null float64 12 product_first_sold_date 19773 non-null datetime64[ns] 13 first_name 19968 non-null object 14 last_name 19326 non-null object 15 gender 19513 non-null object 16 past_3_years_bike_related_purchases 19968 non-null int64 17 DOB 19522 non-null datetime64[ns] 18 job_title 17589 non-null object 19 job_industry_category 16746 non-null object 20 wealth_segment 19968 non-null object 21 deceased_indicator 19968 non-null object 22 owns_car 19968 non-null object 23 tenure 19522 non-null Int64 24 address 19968 non-null object 25 postcode 19968 non-null int64 26 state 19968 non-null object 27 country 19968 non-null object 28 property_valuation 19968 non-null int64 dtypes: Int64(1), bool(1), datetime64[ns](3), float64(2), int64(6), object(16) memory usage: 4.3+ MB
Important Note¶
Int64 and int64 are different datatypes.
I recieved a peculiar error with the same code when I tried to use Int64 instead of int64, turns out some seaborne functions can not handle non nullable data types.
'Int64': 'Int64' is a nullable integer data type introduced in pandas 0.24.0. It allows for the representation of integer data with the presence of NaN values. This means that if you have missing values (NaN) in your integer column, pandas will use the 'Int64' data type to indicate that it supports nullable integers.
'int64': 'int64' (lowercase 'i') is the regular integer data type (non-nullable) provided by NumPy, which is the underlying library used by pandas. In 'int64' data type, there are no NaN values allowed, and the data is treated as non-nullable integers.
So we will convert all the int64 columns to Int64 to avoid any errors.
# Check if the datatype is int64 if it is, convert it to Int64
for col in MasterDataset.columns:
if MasterDataset[col].dtype == "int64":
MasterDataset[col] = MasterDataset[col].astype("Int64")
MasterDataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19968 entries, 0 to 19967 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 transaction_id 19968 non-null Int64 1 product_id 19968 non-null Int64 2 customer_id 19968 non-null Int64 3 transaction_date 19968 non-null datetime64[ns] 4 online_order 19609 non-null object 5 order_status 19968 non-null bool 6 brand 19773 non-null object 7 product_line 19773 non-null object 8 product_class 19773 non-null object 9 product_size 19773 non-null object 10 list_price 19968 non-null float64 11 standard_cost 19773 non-null float64 12 product_first_sold_date 19773 non-null datetime64[ns] 13 first_name 19968 non-null object 14 last_name 19326 non-null object 15 gender 19513 non-null object 16 past_3_years_bike_related_purchases 19968 non-null Int64 17 DOB 19522 non-null datetime64[ns] 18 job_title 17589 non-null object 19 job_industry_category 16746 non-null object 20 wealth_segment 19968 non-null object 21 deceased_indicator 19968 non-null object 22 owns_car 19968 non-null object 23 tenure 19522 non-null Int64 24 address 19968 non-null object 25 postcode 19968 non-null Int64 26 state 19968 non-null object 27 country 19968 non-null object 28 property_valuation 19968 non-null Int64 dtypes: Int64(7), bool(1), datetime64[ns](3), float64(2), object(16) memory usage: 4.4+ MB
Building the Postcoded Dataframe¶
- We would have used Australian Bureau of Statistics to get the postal codes and their corresponding states, but they don't have a free API
- So we will use a CSV file instead
- Made a
Postcodeddataframe with the postal codes and fixed the state column, it is now consistent with the postal codes - We will later use this with Geopandas to make a heatmap of the customer locations
# Import the Australian Postcodes dataset
Postcodes = pd.read_csv(
"/home/meow/Desktop/internship-speedrun/kpmg/australian_postcodes.csv"
)
# Replace 0 with np.nan
Postcodes.replace(0, np.nan, inplace=True)
# We only need the postcode, longitude and latitude columns
# We also need the state column which will make the state consistent with postcodes in the MasterDataset
Postcodes = Postcodes[["postcode", "lat", "long", "state"]]
# Drop all the rows with missing values
Postcodes.dropna(inplace=True)
# Drop all the duplicate rows
Postcodes.drop_duplicates(inplace=True)
# Drop all the non unique postcode rows
Postcodes.drop_duplicates(subset="postcode", inplace=True)
Postcodes.head()
| postcode | lat | long | state | |
|---|---|---|---|---|
| 0 | 200 | -35.277700 | 149.119000 | ACT |
| 2 | 800 | -12.458684 | 130.836680 | NT |
| 4 | 801 | -12.458684 | 130.836680 | NT |
| 6 | 804 | -12.428017 | 130.873315 | NT |
| 7 | 810 | -12.381806 | 130.866242 | NT |
Postcoded = MasterDataset.copy()
Postcoded = pd.merge(
Postcoded,
Postcodes,
how="inner",
left_on="postcode",
right_on="postcode",
)
print(Postcoded.shape)
print(MasterDataset.shape)
(19968, 32) (19968, 29)
# Since state_y is a feature of postcode it is more reliable than state_x
# So we will drop state_x
Postcoded.drop(columns="state_x", inplace=True)
Postcoded.rename(columns={"state_y": "state"}, inplace=True)
Postcoded.sample(5)
| transaction_id | product_id | customer_id | transaction_date | online_order | order_status | brand | product_line | product_class | product_size | list_price | standard_cost | product_first_sold_date | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | address | postcode | country | property_valuation | lat | long | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6125 | 8823 | 42 | 2540 | 2017-10-28 | True | True | OHM Cycles | Road | medium | small | 1810.00 | 1610.90 | 2005-05-20 09:34:07.879592 | Donavon | NaN | NaN | 14 | NaT | Executive Secretary | IT | Mass Customer | False | True | <NA> | 49 Sheridan Circle | 2560 | Australia | 7 | -34.194216 | 150.768408 | NSW |
| 9055 | 15333 | 7 | 1338 | 2017-05-13 | False | True | Giant Bicycles | Standard | medium | small | 1311.44 | 1167.18 | 1986-10-23 09:34:07.879592 | Giuseppe | Tezure | M | 53 | 1980-11-24 | Product Engineer | NaN | High Net Worth | False | False | 14 | 02885 Buhler Court | 3804 | Australia | 9 | -37.980059 | 145.325994 | VIC |
| 19644 | 14493 | 83 | 2408 | 2017-08-13 | False | True | Solex | Touring | medium | large | 2083.94 | 675.03 | 2007-09-28 09:34:07.879592 | Celestia | NaN | F | 38 | 1955-10-10 | VP Marketing | Health | Mass Customer | False | False | 16 | 9 Hollow Ridge Circle | 2500 | Australia | 8 | -34.423171 | 150.874294 | NSW |
| 18464 | 20000 | 11 | 1144 | 2017-09-22 | True | True | Trek Bicycles | Standard | medium | small | 1775.81 | 1580.47 | 1993-07-04 09:34:07.879592 | Damon | Phelip | M | 73 | 1996-12-12 | Environmental Specialist | Health | High Net Worth | False | True | 5 | 4439 7th Terrace | 2177 | Australia | 9 | -33.895053 | 150.879379 | NSW |
| 14983 | 19410 | 79 | 1771 | 2017-01-15 | False | False | Norco Bicycles | Standard | medium | medium | 1555.58 | 818.01 | 2009-08-21 09:34:07.879592 | Zarah | Santello | F | 1 | 1974-11-01 | Geological Engineer | Manufacturing | Mass Customer | False | True | 14 | 410 Hoepker Pass | 4123 | Australia | 5 | -27.583528 | 153.116151 | QLD |
Feature Engineering¶
For deriving insights, we will make a few new columns in our master dataframe. These will be :
customer_age: Age of the customer (only keep customers with age less than 100)customer_age_group: Age group of the customerproduct_age: Age of the productprofit: Profit of the transactionrecency: Recency of the transactionfrequency: Frequency of the transactiontransaction_month: Month of the transactiontransaction_week: Week of the transactiontransaction_day: Day of the transaction
# Calculate the age of the customers
Postcoded["customer_age"] = (end_date - Postcoded["DOB"]).dt.days // 365.25
Postcoded["customer_age"] = Postcoded["customer_age"].astype("Int64")
# Keeping only the values between 14 and 100
Postcoded = Postcoded[
(Postcoded["customer_age"] >= 14) & (Postcoded["customer_age"] <= 100)
]
Postcoded.customer_age.describe()
count 19513.0 mean 39.855122 std 12.59898 min 15.0 25% 30.0 50% 40.0 75% 49.0 max 86.0 Name: customer_age, dtype: Float64
# Calculate the age of the products
Postcoded["product_age"] = (end_date - Postcoded["product_first_sold_date"]).dt.days // 365.25
# Calculate age group
age_group = pd.cut(
Postcoded["customer_age"],
bins=[14, 30, 50, 70, 90],
labels=["14-30", "31-50", "51-70", "71-90"],
)
Postcoded["age_group"] = age_group
# Calculate profit
Postcoded["profit"] = Postcoded["list_price"] - Postcoded["standard_cost"]
# Calculate recency
Postcoded["recency"] = (today_date - Postcoded["transaction_date"]).dt.days.astype("Int64")
# Calculate frequency
Postcoded["frequency"] = Postcoded.groupby("customer_id")["customer_id"].transform(
"count"
)
# Calculate the Transaction Day, Week , Month
Postcoded['transaction_day'] = Postcoded['transaction_date'].dt.strftime('%A')
Postcoded['transaction_week'] = Postcoded['transaction_date'].dt.strftime('%W')
Postcoded['transaction_month'] = Postcoded['transaction_date'].dt.strftime('%B')
Postcoded.info()
<class 'pandas.core.frame.DataFrame'> Index: 19513 entries, 0 to 19967 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 transaction_id 19513 non-null Int64 1 product_id 19513 non-null Int64 2 customer_id 19513 non-null Int64 3 transaction_date 19513 non-null datetime64[ns] 4 online_order 19162 non-null object 5 order_status 19513 non-null bool 6 brand 19327 non-null object 7 product_line 19327 non-null object 8 product_class 19327 non-null object 9 product_size 19327 non-null object 10 list_price 19513 non-null float64 11 standard_cost 19327 non-null float64 12 product_first_sold_date 19327 non-null datetime64[ns] 13 first_name 19513 non-null object 14 last_name 18887 non-null object 15 gender 19513 non-null object 16 past_3_years_bike_related_purchases 19513 non-null Int64 17 DOB 19513 non-null datetime64[ns] 18 job_title 17173 non-null object 19 job_industry_category 16291 non-null object 20 wealth_segment 19513 non-null object 21 deceased_indicator 19513 non-null object 22 owns_car 19513 non-null object 23 tenure 19513 non-null Int64 24 address 19513 non-null object 25 postcode 19513 non-null Int64 26 country 19513 non-null object 27 property_valuation 19513 non-null Int64 28 lat 19513 non-null float64 29 long 19513 non-null float64 30 state 19513 non-null object 31 customer_age 19513 non-null Int64 32 product_age 19327 non-null float64 33 age_group 19513 non-null category 34 profit 19327 non-null float64 35 recency 19513 non-null Int64 36 frequency 19513 non-null Int64 37 transaction_day 19513 non-null object 38 transaction_week 19513 non-null object 39 transaction_month 19513 non-null object dtypes: Int64(10), bool(1), category(1), datetime64[ns](3), float64(6), object(19) memory usage: 6.0+ MB
Univariate Analysis¶
We will use seaborne to make some plots and find insights as done here.
If we find any outliers, we will take a deeper look to understand why they exist and how they affect the distribution.
Since this is a Univariate analysis we have no use for the null values, so we will drop them.
We will make a copy of the Postcoded dataframe as MasterClean and drop the null values from it.
Highlights from the Univariate Analysis¶
- From the Price Distribution, it's clear that it is a normally distributed data with no outliers, and a mean of 1110 USD
- The Profit Distribution is Right Skewed, with a mean of 550 USD
- The Age is Distributed between 20 and 60, with most of the customers being around the age of 30 and 45, makes sense for them to invest in a bicycle.
- The Past 3 Year Bicycle purchase is also normally distributed, with a mean of 49 Purchases
- The Age Segment that buys the most is 31-50, they also bring out the most profit
- Most Purchases have been made on Mondays and Wednesdays, of course buying a bicycle will be the least of my concerns on a weekend.
- August and October are the months with the most sales,
- Most of the sales are made to the Mass Customer segment, accounting for 50% of the sales.
- Most of the sales are made in the NSW region, accounting for 50% of the sales.
- Women seem to be getting more sales by just a percent
- The industry to be targeted is Manufacturing, then Financial Services and Health
# Converting the Postcoded dataset to a new MasterClean dataset
MasterClean = Postcoded.copy()
# Dropping all the rows with missing values
MasterClean.dropna(inplace=True)
# Product Distribution by Prices
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
plt.figure(figsize=(10, 6))
sns.histplot(data=MasterClean, x='list_price')
plt.title('Product Distribution by Prices')
plt.xlabel('Price in USD')
plt.ylabel('Count')
plt.axvline(MasterClean['list_price'].mean(), color='red', linestyle='--', label='Mean')
plt.show()
print('Mean: ', MasterClean['list_price'].mean())
Mean: 1110.1365079248606
# Product Distribution by Profit
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
plt.figure(figsize=(10, 6))
sns.histplot(data=MasterClean, x='profit')
plt.title('Product Distribution by Profit')
plt.xlabel('Price in USD')
plt.ylabel('Count')
plt.axvline(MasterClean['profit'].mean(), color='red', linestyle='--', label='Mean')
plt.axvline(MasterClean['profit'].median(), color='green', linestyle='--', label='Median')
plt.show()
print('Mean: ', MasterClean['profit'].mean())
print('Median: ', MasterClean['profit'].median())
Mean: 552.2205679483417 Median: 445.2099999999999
# Product Distribution by age
import matplotlib.pyplot as plt
import seaborn as sns
# Convert customer_age to int64 instead of Int64
MasterClean['customer_age'] = MasterClean['customer_age'].astype('int64')
sns.set_style('whitegrid')
plt.figure(figsize=(10, 6))
sns.histplot(data=MasterClean, x='customer_age')
plt.title('Product Distribution by Age')
plt.xlabel('Age in Years')
plt.ylabel('Count')
plt.axvline(MasterClean['customer_age'].mean(), color='red', linestyle='--', label='Mean')
plt.show()
print('Mean: ', MasterClean['customer_age'].mean())
Mean: 39.8009245670678
# Product Distribution by age
import matplotlib.pyplot as plt
import seaborn as sns
# Convert customer_age to int64 instead of Int64
MasterClean['past_3_years_bike_related_purchases'] = MasterClean['past_3_years_bike_related_purchases'].astype('int64')
sns.set_style('whitegrid')
plt.figure(figsize=(10, 6))
sns.histplot(data=MasterClean, x='past_3_years_bike_related_purchases')
plt.title('Distribution of Past 3 Years Bike Related Purchases')
plt.xlabel('Purchases')
plt.ylabel('Count')
plt.axvline(MasterClean['past_3_years_bike_related_purchases'].mean(), color='red', linestyle='--', label='Mean')
plt.show()
print('Mean: ', MasterClean['past_3_years_bike_related_purchases'].mean())
Mean: 49.495450542999706
# Order Distribution by Age Group
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
base_color = sns.color_palette()[0]
counts = MasterClean['age_group'].value_counts(normalize=True)
sns.barplot(x= counts.index, y=counts.values, color=base_color)
plt.xlabel('Age Groups')
plt.ylabel('Percentage')
print(counts * 100)
plt.title('Order Distribution by Age Group')
plt.show()
age_group 31-50 52.355445 14-30 25.476959 51-70 21.969475 71-90 0.198122 Name: proportion, dtype: float64
# Profit Distribution by Age Group
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
base_color = sns.color_palette()[0]
profit_by_age = MasterClean.groupby('age_group')['profit'].sum().reset_index()
sns.barplot(x='age_group', y='profit', data=profit_by_age, color=base_color)
# Set labels and title
plt.xlabel('Age Group')
plt.ylabel('Profit in Millions USD')
plt.title('Profit by Age Group')
print(profit_by_age.head())
/tmp/ipykernel_491/3743981089.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
profit_by_age = MasterClean.groupby('age_group')['profit'].sum().reset_index()
age_group profit 0 14-30 1916582.54 1 31-50 3944529.54 2 51-70 1648972.98 3 71-90 15576.84
# Set the size of the figure
plt.figure(figsize=(12, 6))
# Bar plot for transaction day
plt.subplot(1, 2, 1) # Subplot 1
sns.countplot(data=MasterClean, x='transaction_day', color=base_color)
plt.title('Number of Sales per Transaction Day')
plt.xticks(rotation=90)
plt.xlabel('Transaction Day')
plt.ylabel('Number of Sales')
# Bar plot for transaction month
plt.subplot(1, 2, 2) # Subplot 2
sns.countplot(data=MasterClean, x='transaction_month', color=base_color)
plt.title('Number of Sales per Transaction Month')
plt.xlabel('Transaction Month')
plt.ylabel('Number of Sales')
plt.xticks(rotation=90)
# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()
#Orders distribution by wealth segment
plt.figure(figsize = (7,4), dpi = 100)
counts = MasterClean['wealth_segment'].value_counts(normalize = True)
sns.barplot(x = counts.index, y = counts.values, color=base_color)
plt.xlabel('Wealth segment')
plt.ylabel('Count')
print(counts * 100)
plt.title("Orders distribution by wealth segment - percent")
plt.show()
wealth_segment Mass Customer 49.713824 High Net Worth 25.447608 Affluent Customer 24.838568 Name: proportion, dtype: float64
# Orders distribution by state
plt.figure(figsize = (8,4), dpi = 100)
counts = MasterClean['state'].value_counts(normalize = True)
sns.barplot(x = counts.index, y = counts.values, color=base_color)
plt.xlabel('state')
plt.ylabel('Count')
print(counts * 100)
plt.title("Orders distribution by state - percent")
plt.show()
state NSW 52.839742 VIC 25.036689 QLD 22.050191 ACT 0.073378 Name: proportion, dtype: float64
import matplotlib.pyplot as plt
plt.figure(figsize=(7, 7), dpi=100)
counts = MasterClean['gender'].value_counts(normalize=True)
labels = counts.index
sizes = counts.values
colors = ['lightcoral', 'lightblue'] # You can customize the colors here
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
plt.axis('equal') # Equal aspect ratio ensures the pie chart is circular.
plt.title("Distribution of Customer Orders by Gender")
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 8))
# Plotting the distribution of orders by job industry category
plt.subplot(1, 2, 1) # 1 row, 2 columns, first plot
counts_industry = MasterClean['job_industry_category'].value_counts(normalize=True)
sns.barplot(x=counts_industry.index, y=counts_industry.values, color='b')
plt.xlabel('Job Industry Category')
plt.ylabel('Percentage')
plt.xticks(rotation=90) # Rotate x-axis labels to 90 degrees
plt.title("Distribution of Orders by Job Industry Category")
# Plotting the distribution of top 10 job titles
plt.subplot(1, 2, 2) # 1 row, 2 columns, second plot
top_10_titles = MasterClean['job_title'].value_counts(normalize=True).head(10)
sns.barplot(x=top_10_titles.index, y=top_10_titles.values, color='b')
plt.xlabel('Job Title')
plt.ylabel('Percentage')
plt.xticks(rotation=90) # Rotate x-axis labels to 90 degrees
plt.title("Distribution of Top 10 Job Titles")
plt.tight_layout() # To prevent overlapping labels and titles
plt.show()
Bivariate Analysis¶
In this section we will investigate pair of variables and check if there is any relationship between them.
Highlights from the Bivariate Analysis¶
- More of women who own a car buy bikes, and more of men who do not own a car buy bikes
- NSW has the highest number of bike buyers and more Female buyers than male buyers
- More women buy bikes online than men
- Most profitable brand is WeareA2B
These are usually best left to dashboards since it requires too much code to plot in pandas.
# Brand by Profit and Standard Cost
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 6))
# First subplot - Profit by Brand
plt.subplot(1, 2, 1)
sns.barplot(data=MasterClean, x="brand", y='profit', color=sns.color_palette()[0])
plt.xlabel('Brand')
plt.ylabel('Profit (in USD)')
plt.xticks(rotation=90)
plt.title("Brand by Profit")
# Second subplot - Standard Cost by Brand
plt.subplot(1, 2, 2)
sns.barplot(data=MasterClean, x="brand", y='standard_cost', color=sns.color_palette()[1])
plt.xlabel('Brand')
plt.ylabel('Standard Cost')
plt.xticks(rotation=90)
plt.title("Brand by Standard Cost")
plt.tight_layout()
plt.show()
#Gender vs owns a car
plt.figure(figsize = (7,4), dpi = 100)
sns.countplot(data = MasterClean, x = "owns_car",hue='gender')
plt.xlabel('owns a car');
plt.ylabel('count')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Gender vs owns a car")
plt.show()
#State vs Gender
plt.figure(figsize = (8,4), dpi = 100)
sns.countplot(data = MasterClean, x = "state",hue='gender')
plt.xlabel('State');
plt.ylabel('Count')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("State VS Gender")
plt.show()
#State vs Gender
plt.figure(figsize = (8,4), dpi = 100)
sns.countplot(data = MasterClean, x = "online_order",hue='gender')
plt.xlabel('Online Order');
plt.ylabel('Count')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Online Order VS Gender")
plt.show()
# Wealth Segment Percentage and Profits
# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6), dpi=100)
# Calculate the order count and profit per wealth segment
counts = MasterClean['wealth_segment'].value_counts()
profits = MasterClean.groupby('wealth_segment')['profit'].sum()
# Normalize the counts to percentages
counts_percent = (counts / counts.sum()) * 100
# Create a bar plot for order distribution
sns.barplot(x=counts_percent.index, y=counts.values, color='lightblue', ax=ax1)
ax1.set_xlabel('Wealth Segment')
ax1.set_ylabel('Count')
ax1.set_title("Order Distribution by Wealth Segment")
# Add profit labels to the bars in the order distribution plot
for i, p in enumerate(ax1.patches):
ax1.annotate(f'${profits[i]:.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=10, color='black', xytext=(0, 10),
textcoords='offset points')
# Create a bar plot for profit distribution
sns.barplot(x=profits.index, y=profits.values, color='lightgreen', ax=ax2)
ax2.set_xlabel('Wealth Segment')
ax2.set_ylabel('Profit')
ax2.set_title("Profit Distribution by Wealth Segment")
# Add profit labels to the bars in the profit distribution plot
for i, p in enumerate(ax2.patches):
ax2.annotate(f'${profits[i]:.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=10, color='black', xytext=(0, 10),
textcoords='offset points')
# Adjust spacing between subplots
plt.tight_layout()
plt.show()
/tmp/ipykernel_491/1747977723.py:21: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
ax1.annotate(f'${profits[i]:.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
/tmp/ipykernel_491/1747977723.py:33: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
ax2.annotate(f'${profits[i]:.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
Multivariate Analysis¶
In this section we will use correlation matrix to find the correlation between the variables. Nothing of what we already don't know.
# Drop non-numeric columns or encode them properly
numeric_columns = MasterClean.select_dtypes(include=['number'])
# Calculate the correlation matrix
correlation_matrix = numeric_columns.corr()
# Create the correlation matrix heatmap
plt.figure(figsize=(8,6), dpi=100)
sns.heatmap(correlation_matrix, cmap="viridis", annot=True)
plt.title("Correlation Matrix")
plt.xticks(rotation=90)
plt.show()
Task 2 : RFM Analysis¶
RFM stands for Recency, Frequency and Monetary Value, it is a customer segmentation technique that uses past purchase behavior to divide customers into groups. This will include the following steps:
- Calculating the
recencycolumn i.e the number of days since the last purchase - Calculating the
frequencycolumn i.e the number of purchases made by a unique customer - Calculating the
monetarycolumn i.e the total amount of money spent by a unique customer - Making the
RFMdataframe by merging the three columns
Usually we would calculate an RFM score which is also fairly easy to do, but we will be using K-Means clustering to make customer segments instead. We will be using standard scaler and elbow method to find the optimal number of clusters.
Creating the RFM Table¶
- Since we already have the
recency,frequencyandprofitcolumns, we can make the RFM table by merging them. - We will later map them to the master dataframe
# Generating the RFM Table
RfmTable = Postcoded.groupby("customer_id").agg(
{
"recency": lambda x: x.min(),
"frequency": lambda x: x.count(),
"profit": lambda x: x.sum(),
}
)
RfmTable.rename(
columns={
"recency": "recency",
"frequency": "frequency",
"profit": "monetary",
},
inplace=True,
)
RfmTable.head()
| recency | frequency | monetary | |
|---|---|---|---|
| customer_id | |||
| 1 | 9 | 11 | 3018.09 |
| 2 | 130 | 3 | 2226.26 |
| 4 | 197 | 2 | 220.57 |
| 5 | 18 | 6 | 2394.94 |
| 6 | 66 | 5 | 3946.55 |
Finding the Optimal Number of Clusters¶
- We wrote a function to scale the features (using standard scaler)
- We wrote a function to fit the KMeans algorithm on the scaled features and return the model as well as the predictions
- Make sure to run this only on the clean rfm dataframe
The elbow methods shows that the optimal number of clusters is 3, so we will classify each customer into one of these clusters.
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
def plot_elbow_method(df):
scaler = StandardScaler()
scaled = scaler.fit_transform(df)
inertia = []
for i in np.arange(1, 11):
kmeans = KMeans(n_clusters=i, n_init=10, random_state=1) # n_init is the number of times the K-mean algorithm will run with different centroid seeds (default=10), explicityly set to 10 to suppress warning
kmeans.fit(scaled)
inertia.append(kmeans.inertia_)
plt.plot(np.arange(1, 11), inertia, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()
return scaled
# Example usage:
# Assuming you have a DataFrame called 'rfm_table', you can call the function like this:
# plot_elbow_method(rfm_table)
scaled_array = plot_elbow_method(RfmTable)
Classifying the Customers¶
Now that we have the clusters, we will analyse the clusters and see which cluster has the highest profit, and then we will map the customer IDs to the clusters and then predict the high value customers.
- Gold customers are comparatively less in number but have the highest profit
- Silver customers are the most in number but struggle to keep up with the gold customers
- Bronze customers are the least in number, have the best profit to customer ratio but have the lowest overall profit
- Also Bronze customers very rarely buy any products
# Use k-means to cluster the RFM table
kmeans = KMeans(n_clusters=3, n_init=10, random_state=10)
kmeans.fit(scaled_array)
# +1 to make the clusters start from 1 instead of 0
RfmTable['cluster'] = kmeans.labels_+1
RfmTable.head()
| recency | frequency | monetary | cluster | |
|---|---|---|---|---|
| customer_id | ||||
| 1 | 9 | 11 | 3018.09 | 2 |
| 2 | 130 | 3 | 2226.26 | 3 |
| 4 | 197 | 2 | 220.57 | 3 |
| 5 | 18 | 6 | 2394.94 | 1 |
| 6 | 66 | 5 | 3946.55 | 1 |
# Using a ClusterMap to group clusters by their average RFM values
ClusterMap = RfmTable.groupby("cluster").agg(
{
"recency": "mean",
"frequency": "mean",
"monetary": "mean",
}
)
ClusterMap["cluster_size"] = RfmTable.groupby("cluster")["cluster"].count()
# Sorting the ClusterMap by monetary_value in descending order
SortedClusterMap = ClusterMap.sort_values(by="monetary", ascending=False)
# Preparing the customer_category column
# customer_category = ["Platinum","Gold", "Silver", "Bronze"]
customer_category = ["Gold", "Silver", "Bronze"]
SortedClusterMap["customer_category"] = customer_category
CategoryCluster = SortedClusterMap["customer_category"].to_dict()
SortedClusterMap
| recency | frequency | monetary | cluster_size | customer_category | |
|---|---|---|---|---|---|
| cluster | |||||
| 2 | 40.156137 | 8.150722 | 5023.386372 | 1108 | Gold |
| 1 | 43.959677 | 4.846198 | 2312.542056 | 1736 | Silver |
| 3 | 166.327465 | 3.642606 | 1871.216673 | 568 | Bronze |
Visualizing the Clusters¶
- We will make an interactive cluster plot using plotly and see if we can find any insights.
- We will also use seaborn to make a cluster plot.
# Using Plotly to plot the ClusterMap
import plotly.express as px
# Create an interactive 3D scatter plot
fig = px.scatter_3d(
RfmTable,
x='recency',
y='frequency',
z='monetary',
color='cluster',
size_max=10,
opacity=0.7,
labels={'recency': 'Recency', 'frequency': 'Frequency', 'monetary': 'Monetary'},
title='RFM Clusters - Interactive 3D Scatter Plot'
)
fig.show()
# Using SeaBorn to plot the ClusterMap
import seaborn as sns
import matplotlib.pyplot as plt
# Create a joint plot for 'recency' and 'monetary'
sns.set(style="white")
fig1 = sns.jointplot(
data=RfmTable,
x='recency',
y='monetary',
hue='cluster',
alpha=0.7,
markers=["o", "s", "D"],
height=7
)
fig1.set_axis_labels('Recency', 'Monetary')
fig1.fig.suptitle('RFM Clusters - Recency vs Monetary', y=1.02)
# Create a joint plot for 'frequency' and 'monetary'
fig2 = sns.jointplot(
data=RfmTable,
x='frequency',
y='monetary',
hue='cluster',
alpha=0.7,
markers=["o", "s", "D"],
height=7
)
fig2.set_axis_labels('Frequency', 'Monetary')
fig2.fig.suptitle('RFM Clusters - Frequency vs Monetary', y=1.02)
# Display the subplots
plt.show()
Mapping the Clusters to the Master Dataframe¶
- This will add two new columns to the master dataframe,
clusterandcustomer_category
# Mapping the clusters to the customers
Postcoded['cluster'] = Postcoded['customer_id'].map(RfmTable['cluster'])
Postcoded['customer_category'] = Postcoded['cluster'].map(CategoryCluster)
Postcoded.sample(5)
| transaction_id | product_id | customer_id | transaction_date | online_order | order_status | brand | product_line | product_class | product_size | list_price | standard_cost | product_first_sold_date | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | address | postcode | country | property_valuation | lat | long | state | customer_age | product_age | age_group | profit | recency | frequency | transaction_day | transaction_week | transaction_month | cluster | customer_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13170 | 14831 | 15 | 2618 | 2017-03-31 | False | True | Norco Bicycles | Standard | low | medium | 958.74 | 748.90 | 1999-12-19 09:34:07.879592 | Jay | Prue | M | 85 | 1977-12-14 | Administrative Assistant IV | NaN | Mass Customer | False | False | 4 | 2271 Fordem Plaza | 2780 | Australia | 8 | -33.832923 | 150.435639 | NSW | 40 | 18.0 | 31-50 | 209.84 | 276 | 10 | Friday | 13 | March | 2 | Gold |
| 2078 | 12054 | 76 | 1790 | 2017-11-28 | True | True | WeareA2B | Standard | low | medium | 642.31 | 513.85 | 2008-10-21 09:34:07.879592 | Vite | Molyneux | M | 90 | 1993-09-25 | Budget/Accounting Analyst IV | Health | Affluent Customer | False | False | 1 | 4092 Brentwood Trail | 2147 | Australia | 10 | -33.766075 | 150.935836 | NSW | 24 | 9.0 | 14-30 | 128.46 | 34 | 5 | Tuesday | 48 | November | 1 | Silver |
| 5730 | 12667 | 89 | 1594 | 2017-12-30 | True | True | WeareA2B | Touring | medium | large | 1362.99 | 57.74 | 2010-07-21 09:34:07.879592 | Rolando | Amyes | M | 62 | 1981-11-26 | Human Resources Manager | Property | Mass Customer | False | False | 7 | 0 Thackeray Avenue | 2100 | Australia | 10 | -33.766409 | 151.257781 | NSW | 36 | 7.0 | 31-50 | 1305.25 | 2 | 6 | Saturday | 52 | December | 2 | Gold |
| 5804 | 16817 | 31 | 673 | 2017-07-02 | False | True | Giant Bicycles | Standard | medium | medium | 230.91 | 173.18 | 1989-12-30 09:34:07.879592 | Artie | Ommanney | M | 35 | 1975-06-27 | VP Sales | Health | Mass Customer | False | True | 4 | 2836 Morning Drive | 2304 | Australia | 7 | -32.880319 | 151.738620 | NSW | 42 | 28.0 | 31-50 | 57.73 | 183 | 5 | Sunday | 26 | July | 1 | Silver |
| 3780 | 100 | 0 | 119 | 2017-07-31 | False | True | Solex | Standard | medium | medium | 478.16 | 298.72 | 1992-12-28 09:34:07.879592 | Willey | Chastanet | M | 9 | 1981-12-04 | Associate Professor | NaN | High Net Worth | False | True | 9 | 5814 Bunker Hill Street | 3070 | Australia | 7 | -37.774064 | 144.999745 | VIC | 36 | 25.0 | 31-50 | 179.44 | 154 | 4 | Monday | 31 | July | 1 | Silver |
Task 2 : Geospatial Analysis¶
In this section we will only focus on geopandas and try to make customer segmentation maps.
We will also try to geocode the customer adresses to get another lat and long columngeo_latandgeo_long- Geocoding will take a lot of time so we'll just work the with the coordinates we already have
Seems like our most sales are in coastal areas, with that lovely view guess I'll buy a bike too.
- We will make a new column
distance_from_coastwhich will be the distance of the customer from the coast - NSW has the most sales, and the most sales are in the coastal areas of NSW
- Followed by Victoria and then Queensland
# Pepraring the GeoCustomers dataframe
# We will use this to draw interactive maps
GeoCustomers = Postcoded.copy()
# Use groupby to get unique customer_id
GeoCustomers = GeoCustomers.groupby('customer_id').first().reset_index()
GeoCustomers = GeoCustomers[['customer_id', 'lat', 'long', 'state', 'cluster', 'customer_category']]
GeoCustomers.head()
| customer_id | lat | long | state | cluster | customer_category | |
|---|---|---|---|---|---|---|
| 0 | 1 | -33.894912 | 151.206211 | NSW | 2 | Gold |
| 1 | 2 | -33.731651 | 150.955942 | NSW | 3 | Bronze |
| 2 | 4 | -28.035453 | 153.241258 | QLD | 3 | Bronze |
| 3 | 5 | -30.604667 | 152.956681 | NSW | 1 | Silver |
| 4 | 6 | -38.215906 | 144.334005 | VIC | 1 | Silver |
# Importing Folium and Geopandas
import geopandas as gpd
import folium
# Importing folium plugins
from folium.plugins import MarkerCluster, HeatMap
# Creating a simple map
m_1 = folium.Map(location=[-25.2744, 133.7751], tiles='cartodbpositron', zoom_start=4)
# Adding transaction points to the map using MarkerCluster
mc = MarkerCluster()
for idx, row in GeoCustomers.iterrows():
if not np.isnan(row['lat']) and not np.isnan(row['long']):
mc.add_child(folium.Marker([row['lat'], row['long']]))
m_1.add_child(mc)
# Display the map
m_1
# Drawing a heatmap of the transaction points
m_2 = folium.Map(location=[-25.2744, 133.7751], tiles='cartodbpositron', zoom_start=4)
# Adding transaction points to the map using HeatMap
HeatMap(data=GeoCustomers[['lat', 'long']], radius=15).add_to(m_2)
# Display the map
m_2
# Draw a bubble map to visualize the customer catergories gold, silver and bronze
# Create a base map
m_3 = folium.Map(location=[-25.2744, 133.7751], tiles='cartodbpositron', zoom_start=4)
# Add a bubble map to the base map
for idx, row in GeoCustomers.iterrows():
if not np.isnan(row['lat']) and not np.isnan(row['long']):
if row['customer_category'] == 'Gold':
folium.CircleMarker([row['lat'], row['long']], radius=5, color='yellow', fill=True).add_to(m_3)
elif row['customer_category'] == 'Silver':
folium.CircleMarker([row['lat'], row['long']], radius=5, color='blue', fill=True).add_to(m_3)
elif row['customer_category'] == 'Bronze':
folium.CircleMarker([row['lat'], row['long']], radius=5, color='brown', fill=True).add_to(m_3)
# Display the map
m_3
Analysing Coastline Proximity¶
Refer to this notebook for the code.
GeoCustomers
| customer_id | lat | long | state | cluster | customer_category | |
|---|---|---|---|---|---|---|
| 0 | 1 | -33.894912 | 151.206211 | NSW | 2 | Gold |
| 1 | 2 | -33.731651 | 150.955942 | NSW | 3 | Bronze |
| 2 | 4 | -28.035453 | 153.241258 | QLD | 3 | Bronze |
| 3 | 5 | -30.604667 | 152.956681 | NSW | 1 | Silver |
| 4 | 6 | -38.215906 | 144.334005 | VIC | 1 | Silver |
| ... | ... | ... | ... | ... | ... | ... |
| 3407 | 3496 | -33.937716 | 150.848737 | NSW | 3 | Bronze |
| 3408 | 3497 | -38.043995 | 145.264296 | VIC | 1 | Silver |
| 3409 | 3498 | -37.807135 | 144.861162 | VIC | 3 | Bronze |
| 3410 | 3499 | -27.549179 | 152.951385 | QLD | 2 | Gold |
| 3411 | 3500 | -33.766409 | 151.257781 | NSW | 3 | Bronze |
3412 rows × 6 columns
from shapely.ops import nearest_points, Point
import haversine as hs
from haversine import Unit
def coastline_minima(df, lat_col, long_col, country):
"""
This function takes in a dataframe with latitude and longitude columns and returns the nearest point on the coastline and the distance in km
Required libraries: geopandas, shapely, folium, haversine
Returns: original dataframe with two new columns: nearest_point and distance_to_coast
"""
# Getting the coastline
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
coastline = world[world.name.str.contains(country)].boundary
# Creating empty lists to store the nearest point and the distance
nearest_points_list = []
distance_list = []
for i in range(len(df)):
geo = Point(df[long_col][i], df[lat_col][i])
nearest = nearest_points(geo, coastline)[1]
cos = Point(nearest.x, nearest.y)
loc1 = (geo.y, geo.x)
loc2 = (cos.y, cos.x)
result = hs.haversine(loc1, loc2, unit=Unit.KILOMETERS)
nearest_points_list.append(nearest)
distance_list.append(result)
# Adding the new columns to the dataframe
df["nearest_point"] = nearest_points_list
df["distance_to_coast"] = distance_list
return df
coastline_minima(GeoCustomers, 'lat', 'long', 'Australia')
GeoCustomers.head()
/tmp/ipykernel_491/1020208499.py:14: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
| customer_id | lat | long | state | cluster | customer_category | nearest_point | distance_to_coast | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | -33.894912 | 151.206211 | NSW | 2 | Gold | 137 POINT (151.26433 -33.93411) dtype: geom... | 6.910553 |
| 1 | 2 | -33.731651 | 150.955942 | NSW | 3 | Bronze | 137 POINT (151.26176 -33.93792) dtype: geom... | 36.385439 |
| 2 | 4 | -28.035453 | 153.241258 | QLD | 3 | Bronze | 137 POINT (153.45911 -27.91328) dtype: geom... | 25.342373 |
| 3 | 5 | -30.604667 | 152.956681 | NSW | 1 | Silver | 137 POINT (153.07812 -30.60035) dtype: geom... | 11.632766 |
| 4 | 6 | -38.215906 | 144.334005 | VIC | 1 | Silver | 137 POINT (144.33147 -38.21284) dtype: geom... | 0.406144 |
# Merging Geocustomers dataframe with the Postcoded dataframe
Postcoded = pd.merge(
Postcoded,
GeoCustomers[['customer_id', 'nearest_point', 'distance_to_coast']],
how="inner",
left_on="customer_id",
right_on="customer_id",
)
Postcoded.sample(5)
| transaction_id | product_id | customer_id | transaction_date | online_order | order_status | brand | product_line | product_class | product_size | list_price | standard_cost | product_first_sold_date | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | address | postcode | country | property_valuation | lat | long | state | customer_age | product_age | age_group | profit | recency | frequency | transaction_day | transaction_week | transaction_month | cluster | customer_category | nearest_point | distance_to_coast | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15259 | 10956 | 50 | 710 | 2017-11-21 | False | True | Giant Bicycles | Standard | medium | medium | 642.70 | 211.37 | 1996-04-02 09:34:07.879592 | Benton | Oakly | M | 76 | 1970-03-01 | Sales Representative | Retail | High Net Worth | False | True | 13 | 1721 Nobel Junction | 2753 | Australia | 8 | -33.604022 | 150.646053 | NSW | 47 | 21.0 | 31-50 | 431.33 | 41 | 7 | Tuesday | 47 | November | 2 | Gold | 137 POINT (151.22403 -33.99385) dtype: geom... | 68.783845 |
| 8448 | 2666 | 28 | 298 | 2017-03-07 | False | True | Solex | Road | medium | small | 1703.52 | 1516.13 | 2005-04-27 09:34:07.879592 | Maurise | Lenglet | M | 69 | 1956-08-15 | Engineer I | Manufacturing | High Net Worth | False | False | 9 | 83 Brickson Park Drive | 3029 | Australia | 8 | -37.837165 | 144.705831 | VIC | 61 | 12.0 | 51-70 | 187.39 | 300 | 11 | Tuesday | 10 | March | 2 | Gold | 137 POINT (144.75898 -37.99074) dtype: geom... | 17.702334 |
| 17705 | 19788 | 92 | 2474 | 2017-08-09 | False | True | WeareA2B | Touring | medium | large | 1890.39 | 260.14 | 1992-12-28 09:34:07.879592 | Chrysa | Burrett | F | 59 | 1980-06-08 | Executive Secretary | Argiculture | High Net Worth | False | True | 6 | 70 Erie Street | 4814 | Australia | 2 | -19.285821 | 146.754745 | QLD | 37 | 25.0 | 31-50 | 1630.25 | 145 | 8 | Wednesday | 32 | August | 2 | Gold | 137 POINT (146.81361 -19.16373) dtype: geom... | 14.916644 |
| 1252 | 11862 | 76 | 867 | 2017-01-31 | True | True | WeareA2B | Standard | low | medium | 642.31 | 513.85 | 2008-10-21 09:34:07.879592 | Lura | MacKim | F | 12 | 1973-05-10 | VP Marketing | NaN | Mass Customer | False | False | 17 | 8360 Washington Avenue | 3124 | Australia | 12 | -37.840867 | 145.068191 | VIC | 44 | 9.0 | 31-50 | 128.46 | 335 | 6 | Tuesday | 05 | January | 3 | Bronze | 137 POINT (145.03221 -37.89619) dtype: geom... | 6.914766 |
| 7009 | 7460 | 16 | 1234 | 2017-02-12 | NaN | True | Norco Bicycles | Standard | high | small | 1661.92 | 1479.11 | 1988-09-20 09:34:07.879592 | Rancell | Yven | M | 87 | 1965-04-19 | Geologist I | NaN | Mass Customer | False | False | 17 | 3765 Mandrake Alley | 2763 | Australia | 9 | -33.725619 | 150.895470 | NSW | 52 | 29.0 | 51-70 | 182.81 | 323 | 8 | Sunday | 06 | February | 1 | Silver | 137 POINT (151.24565 -33.96180) dtype: geom... | 41.660704 |
Task 3 : Model Development¶
Final task of this project, we will use all the data and features we built earlier and make a machine learning model to predict which customer segment the new customers will fall into.
We will build two models
- a simple logistic regression model
- a HistGradientBoostingClassifier model
We will also use RandomSearchCV to find the best parameters for the models.
Data Preprocessing¶
Lets see what we are working with
- We have
Postcodedas the Training Data - We have
NewCustomerListas the Testing Data
We need to add features to the testing data as well.
Train = Postcoded.copy()
Train.columns
Index(['transaction_id', 'product_id', 'customer_id', 'transaction_date',
'online_order', 'order_status', 'brand', 'product_line',
'product_class', 'product_size', 'list_price', 'standard_cost',
'product_first_sold_date', 'first_name', 'last_name', 'gender',
'past_3_years_bike_related_purchases', 'DOB', 'job_title',
'job_industry_category', 'wealth_segment', 'deceased_indicator',
'owns_car', 'tenure', 'address', 'postcode', 'country',
'property_valuation', 'lat', 'long', 'state', 'customer_age',
'product_age', 'age_group', 'profit', 'recency', 'frequency',
'transaction_day', 'transaction_week', 'transaction_month', 'cluster',
'customer_category', 'nearest_point', 'distance_to_coast'],
dtype='object')
Test = NewCustomerList.copy()
Test.columns
Index(['first_name', 'last_name', 'gender',
'past_3_years_bike_related_purchases', 'DOB', 'job_title',
'job_industry_category', 'wealth_segment', 'deceased_indicator',
'owns_car', 'tenure', 'address', 'postcode', 'state', 'country',
'property_valuation', 'Rank', 'Value'],
dtype='object')
Feature Engineering¶
We'll be engineering the following features in the test data:
- customer_age
- age_group
- lat, long
- distance_to_coast
So basically we will only be expanding the DOB and postcode columns. Lets also try and multiply the property_valuation with the tenure, that might do something.
And we will add two more features based on the Property Valuation and Tenure to both the Training and Testing Data
- tenure_valuation_mul
- tenure_valuation_div
And also potential_customer_id to the testing data
Age and Age Group¶
- customer_age
- age_group
# Customer Age Feature
Test["DOB"] = pd.to_datetime(Test["DOB"])
Test["customer_age"] = (end_date - Test["DOB"]).dt.days // 365.25
Test["customer_age"] = Test["customer_age"].astype("Int64")
# Keeping only the values between 14 and 100
Test = Test[Test["customer_age"].between(14, 100)]
# Calculate the Age Group
age_group = pd.cut(
Test["customer_age"],
bins=[14,30,50,70,90],
labels=["14-30","30-50","50-70","70-90"]
)
Test["age_group"] = age_group
Geo Location¶
- lat
- long
- distance_to_coast
- state
- nearest_point
# Adding the Coordinate Columns
Test = pd.merge(
Test,
Postcodes[['postcode', 'lat', 'long', 'state']],
how="inner",
left_on="postcode",
right_on="postcode",
)
print(Test.shape)
print(Postcodes.shape)
# Dropping the state_x column
Test.drop(columns="state_x", inplace=True)
Test.rename(columns={"state_y": "state"}, inplace=True)
Test.sample(5)
(983, 23) (3167, 4)
| first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | address | postcode | country | property_valuation | Rank | Value | customer_age | age_group | lat | long | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 444 | Jared | Fendlow | M | 79 | 1963-08-25 | Media Manager II | Financial Services | Mass Customer | False | False | 15 | 6195 Bellgrove Lane | 4211 | Australia | 7 | 832 | 0.5750 | 54 | 50-70 | -28.035453 | 153.241258 | QLD |
| 753 | Georgi | NaN | M | 29 | 1970-01-14 | Assistant Manager | Manufacturing | High Net Worth | False | False | 11 | 59 Garrison Terrace | 3215 | Australia | 4 | 485 | 0.8755 | 47 | 30-50 | -38.111002 | 144.334520 | VIC |
| 712 | Frederigo | Cribbott | M | 35 | 1965-03-15 | Social Worker | Health | Mass Customer | False | False | 16 | 42280 Namekagon Crossing | 2140 | Australia | 8 | 754 | 0.6460 | 52 | 50-70 | -33.835018 | 151.070853 | NSW |
| 279 | Rebeca | Aggas | F | 66 | 1953-02-27 | Social Worker | Health | Affluent Customer | False | False | 21 | 7026 Katie Lane | 3818 | Australia | 1 | 127 | 1.2500 | 64 | 50-70 | -38.096289 | 145.867337 | VIC |
| 250 | Kearney | Cuddehy | M | 88 | 1997-05-18 | Marketing Assistant | Retail | Mass Customer | False | False | 12 | 98 Shoshone Road | 4207 | Australia | 6 | 634 | 0.7395 | 20 | 14-30 | -27.733991 | 153.204523 | QLD |
# Caculating the distance to coast
Test = coastline_minima(Test, 'lat', 'long', 'Australia')
/tmp/ipykernel_491/1020208499.py:14: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
Tenure and Property Valuation¶
- tenure_valuation_mul
- tenure_valuation_div
# Multiplying and Dividing the tenure and property_valuation
Train["tenure_valuation_mul"] = Train["tenure"] * Train["property_valuation"]
Train["tenure_valuation_div"] = Train["tenure"] / Train["property_valuation"]
# Same for Test
Test["tenure_valuation_mul"] = Test["tenure"] * Test["property_valuation"]
Test["tenure_valuation_div"] = Test["tenure"] / Test["property_valuation"]
Potential Customer ID¶
- potential_customer_id
Test['potential_customer_id'] = Test.index + 1
Building a Basic Logistic Regression Model¶
- We will copy the training data to
data - Look out for leakage while building the model, drop all the columns that are not present in the testing data
- Also drop the columns that can't be classified as numerical or categorical
Feature Selection¶
- defining
colswith all the features that are supposed to go into data
# Getting all the valid columns for building the model
cols = [
"gender",
"past_3_years_bike_related_purchases",
"job_title",
"job_industry_category",
"wealth_segment",
"deceased_indicator",
"owns_car",
"tenure",
"property_valuation",
"lat",
"long",
"state",
"customer_age",
"age_group",
"distance_to_coast",
"tenure_valuation_mul",
"tenure_valuation_div",
"customer_category",
]
data = Train[cols]
# Defining the target variable
target_name = "customer_category"
target = data[target_name]
data = data.drop(columns=target_name)
Defining Categorical and Numerical Columns¶
- We will convert all the datatypes appropriately so that there's only either objects or numbers in the dataframe
- We will use sklearns
make_column_selectorto select the columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19513 entries, 0 to 19512 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 19513 non-null object 1 past_3_years_bike_related_purchases 19513 non-null Int64 2 job_title 17173 non-null object 3 job_industry_category 16291 non-null object 4 wealth_segment 19513 non-null object 5 deceased_indicator 19513 non-null object 6 owns_car 19513 non-null object 7 tenure 19513 non-null Int64 8 property_valuation 19513 non-null Int64 9 lat 19513 non-null float64 10 long 19513 non-null float64 11 state 19513 non-null object 12 customer_age 19513 non-null Int64 13 age_group 19513 non-null category 14 distance_to_coast 19513 non-null float64 15 tenure_valuation_mul 19513 non-null Int64 16 tenure_valuation_div 19513 non-null Float64 dtypes: Float64(1), Int64(5), category(1), float64(3), object(7) memory usage: 2.5+ MB
# Convert age_group to object
data["age_group"] = data["age_group"].astype("object")
# Convert bools to object
bool_cols = data.columns[data.dtypes.eq("bool")]
data[bool_cols] = data[bool_cols].astype("object")
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19513 entries, 0 to 19512 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 19513 non-null object 1 past_3_years_bike_related_purchases 19513 non-null Int64 2 job_title 17173 non-null object 3 job_industry_category 16291 non-null object 4 wealth_segment 19513 non-null object 5 deceased_indicator 19513 non-null object 6 owns_car 19513 non-null object 7 tenure 19513 non-null Int64 8 property_valuation 19513 non-null Int64 9 lat 19513 non-null float64 10 long 19513 non-null float64 11 state 19513 non-null object 12 customer_age 19513 non-null Int64 13 age_group 19513 non-null object 14 distance_to_coast 19513 non-null float64 15 tenure_valuation_mul 19513 non-null Int64 16 tenure_valuation_div 19513 non-null Float64 dtypes: Float64(1), Int64(5), float64(3), object(8) memory usage: 2.6+ MB
from sklearn.compose import make_column_selector as selector
numerical_columns_selector = selector(dtype_exclude=object)
category_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(data)
category_columns = category_columns_selector(data)
# Print the lengths of numerical_columns, category_columns, and data.columns
print(f"Number of numerical columns: {len(numerical_columns)}")
print(f"Number of categorical columns: {len(category_columns)}")
print(f"Total number of columns: {len(data.columns)}")
Number of numerical columns: 9 Number of categorical columns: 8 Total number of columns: 17
# One-hot encoding the categorical columns and standardizing the numerical columns
from sklearn.preprocessing import OneHotEncoder, StandardScaler
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()
# Creating the Transformer and preprocessor object
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
[
("cat-preprocessor", categorical_preprocessor, category_columns),
("num-preprocessor", numerical_preprocessor, numerical_columns),
]
)
# Building a model with a pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
model = make_pipeline(preprocessor, LogisticRegression(max_iter=1000))
model
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('cat-preprocessor',
OneHotEncoder(handle_unknown='ignore'),
['gender', 'job_title',
'job_industry_category',
'wealth_segment',
'deceased_indicator',
'owns_car', 'state',
'age_group']),
('num-preprocessor',
StandardScaler(),
['past_3_years_bike_related_purchases',
'tenure',
'property_valuation', 'lat',
'long', 'customer_age',
'distance_to_coast',
'tenure_valuation_mul',
'tenure_valuation_div'])])),
('logisticregression', LogisticRegression(max_iter=1000))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('cat-preprocessor',
OneHotEncoder(handle_unknown='ignore'),
['gender', 'job_title',
'job_industry_category',
'wealth_segment',
'deceased_indicator',
'owns_car', 'state',
'age_group']),
('num-preprocessor',
StandardScaler(),
['past_3_years_bike_related_purchases',
'tenure',
'property_valuation', 'lat',
'long', 'customer_age',
'distance_to_coast',
'tenure_valuation_mul',
'tenure_valuation_div'])])),
('logisticregression', LogisticRegression(max_iter=1000))])ColumnTransformer(transformers=[('cat-preprocessor',
OneHotEncoder(handle_unknown='ignore'),
['gender', 'job_title',
'job_industry_category', 'wealth_segment',
'deceased_indicator', 'owns_car', 'state',
'age_group']),
('num-preprocessor', StandardScaler(),
['past_3_years_bike_related_purchases',
'tenure', 'property_valuation', 'lat', 'long',
'customer_age', 'distance_to_coast',
'tenure_valuation_mul',
'tenure_valuation_div'])])['gender', 'job_title', 'job_industry_category', 'wealth_segment', 'deceased_indicator', 'owns_car', 'state', 'age_group']
OneHotEncoder(handle_unknown='ignore')
['past_3_years_bike_related_purchases', 'tenure', 'property_valuation', 'lat', 'long', 'customer_age', 'distance_to_coast', 'tenure_valuation_mul', 'tenure_valuation_div']
StandardScaler()
LogisticRegression(max_iter=1000)
# Making the train-test split
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
data, target, test_size=0.2, random_state=42
)
_ = model.fit(data_train, target_train)
data_test.head()
| gender | past_3_years_bike_related_purchases | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | property_valuation | lat | long | state | customer_age | age_group | distance_to_coast | tenure_valuation_mul | tenure_valuation_div | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9530 | F | 60 | Administrative Officer | Financial Services | High Net Worth | False | False | 12 | 7 | -37.869072 | 144.999026 | VIC | 38 | 31-50 | 3.973526 | 84 | 1.714286 |
| 385 | M | 68 | Financial Analyst | Financial Services | Mass Customer | False | False | 3 | 10 | -33.731523 | 150.931074 | NSW | 36 | 31-50 | 38.424585 | 30 | 0.3 |
| 15307 | M | 56 | Editor | Financial Services | Mass Customer | False | False | 5 | 10 | -37.744406 | 145.082685 | VIC | 41 | 31-50 | 17.449963 | 50 | 0.5 |
| 4153 | F | 78 | Quality Engineer | Financial Services | Affluent Customer | False | True | 17 | 11 | -33.732380 | 151.000948 | NSW | 43 | 31-50 | 32.667056 | 187 | 1.545455 |
| 8658 | M | 23 | Librarian | Entertainment | Mass Customer | False | False | 10 | 8 | -33.783751 | 150.900547 | NSW | 48 | 31-50 | 38.026580 | 80 | 1.25 |
Scores and Predictions¶
- Model Score
0.5318985395849347
model.predict(data_test)[:5]
array(['Gold', 'Gold', 'Silver', 'Gold', 'Gold'], dtype=object)
target_test[:5]
9530 Bronze 385 Gold 15307 Gold 4153 Gold 8658 Gold Name: customer_category, dtype: object
model.score(data_test, target_test)
0.5318985395849347
# Evaluating the model with cross-validation
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data, target, cv=5)
cv_results
{'fit_time': array([1.96170378, 2.41558146, 1.45675778, 1.40848422, 1.35900307]),
'score_time': array([0.01325226, 0.02733541, 0.01228523, 0.01573133, 0.01378798]),
'test_score': array([0.46682039, 0.45836536, 0.46169613, 0.47514095, 0.46258329])}
# Getting the final score for our model
scores = cv_results["test_score"]
# Print mean cross-validation accuracy
print(f"Mean Accuracy: {scores.mean():.3f}")
print(f"Accuracy standard deviation: {scores.std():.3f}")
Mean Accuracy: 0.465 Accuracy standard deviation: 0.006
Building a Better Model¶
HistGradientBoostingClassifieris a better model than our naive logistic regression model- We will use
RandomizedSearchCVto find the best parameters for the model
Apparently Scaling numerical features is indeed useless for most decision tree models in general and for HistGradientBoostingClassifier in particular. We get better results with the Ordinal Encoder than with the One Hot Encoder, and Scaling the numerical features does not make a difference.
Visit here for more information.
| Meaningful order | Non-meaningful order | |
|---|---|---|
| Tree-based model | OrdinalEncoder | OrdinalEncoder |
| Linear model | OrdinalEncoder with caution | OneHotEncoder |
- OneHotEncoder : always does something meaningful, but can be unnecessarily slow with trees.
- OrdinalEncoder : can be detrimental for linear models unless your category has a meaningful order and you make sure that OrdinalEncoder respects this order. Trees can deal with OrdinalEncoder fine as long as they are deep enough.
Hence we will use the Ordinal Encoder for the categorical features, for our use case since it is giving us the best score.
Bulding the Pipeline¶
- We will build a pipeline with Ordinal Encoder and HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
data, target, random_state=42
)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector as selector
categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
categorical_preprocessor = OrdinalEncoder(
handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = ColumnTransformer(
[("cat_preprocessor", categorical_preprocessor, categorical_columns)],
remainder="passthrough",
)
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline
model = Pipeline(
[
("preprocessor", preprocessor),
(
"classifier",
HistGradientBoostingClassifier(random_state=42, max_leaf_nodes=4),
),
]
)
model
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat_preprocessor',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1),
['gender', 'job_title',
'job_industry_category',
'wealth_segment',
'deceased_indicator',
'owns_car', 'state',
'age_group'])])),
('classifier',
HistGradientBoostingClassifier(max_leaf_nodes=4,
random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat_preprocessor',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1),
['gender', 'job_title',
'job_industry_category',
'wealth_segment',
'deceased_indicator',
'owns_car', 'state',
'age_group'])])),
('classifier',
HistGradientBoostingClassifier(max_leaf_nodes=4,
random_state=42))])ColumnTransformer(remainder='passthrough',
transformers=[('cat_preprocessor',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1),
['gender', 'job_title',
'job_industry_category', 'wealth_segment',
'deceased_indicator', 'owns_car', 'state',
'age_group'])])['gender', 'job_title', 'job_industry_category', 'wealth_segment', 'deceased_indicator', 'owns_car', 'state', 'age_group']
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
passthrough
HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42)
Applying RandomizedSearchCV on the Pipeline¶
- Code referenced form INRIA MOOC
The best parameters are:
{'classifier__l2_regularization': 0.022391943681217833,
'classifier__learning_rate': 0.006270138607046264,
'classifier__max_bins': 4,
'classifier__max_leaf_nodes': 142,
'classifier__min_samples_leaf': 31}
from scipy.stats import loguniform
class loguniform_int:
"""Integer valued version of the log-uniform distribution"""
def __init__(self, a, b):
self._distribution = loguniform(a, b)
def rvs(self, *args, **kwargs):
"""Random variable sample"""
return self._distribution.rvs(*args, **kwargs).astype(int)
%%time
from sklearn.model_selection import RandomizedSearchCV
param_distributions = {
"classifier__l2_regularization": loguniform(1e-6, 1e3),
"classifier__learning_rate": loguniform(0.001, 10),
"classifier__max_leaf_nodes": loguniform_int(2, 256),
"classifier__min_samples_leaf": loguniform_int(1, 100),
"classifier__max_bins": loguniform_int(2, 255),
}
model_random_search = RandomizedSearchCV(
model,
param_distributions=param_distributions,
n_iter=5,
cv=5,
verbose=1,
n_jobs=-1,
)
model_random_search.fit(data_train, target_train)
Fitting 5 folds for each of 5 candidates, totalling 25 fits CPU times: user 25.6 s, sys: 580 ms, total: 26.1 s Wall time: 17.6 s
RandomizedSearchCV(cv=5,
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat_preprocessor',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1),
['gender',
'job_title',
'job_industry_category',
'wealth_segment',
'deceased_indicator',
'owns_car',
'state',
'age_group'])])),
('classifier',
HistGr...
param_distributions={'classifier__l2_regularization': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f920c9ea2f0>,
'classifier__learning_rate': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f91a5a77880>,
'classifier__max_bins': <__main__.loguniform_int object at 0x7f91a5a9c340>,
'classifier__max_leaf_nodes': <__main__.loguniform_int object at 0x7f91a5a76aa0>,
'classifier__min_samples_leaf': <__main__.loguniform_int object at 0x7f91a5a77580>},
verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5,
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat_preprocessor',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1),
['gender',
'job_title',
'job_industry_category',
'wealth_segment',
'deceased_indicator',
'owns_car',
'state',
'age_group'])])),
('classifier',
HistGr...
param_distributions={'classifier__l2_regularization': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f920c9ea2f0>,
'classifier__learning_rate': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7f91a5a77880>,
'classifier__max_bins': <__main__.loguniform_int object at 0x7f91a5a9c340>,
'classifier__max_leaf_nodes': <__main__.loguniform_int object at 0x7f91a5a76aa0>,
'classifier__min_samples_leaf': <__main__.loguniform_int object at 0x7f91a5a77580>},
verbose=1)Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('cat_preprocessor',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1),
['gender', 'job_title',
'job_industry_category',
'wealth_segment',
'deceased_indicator',
'owns_car', 'state',
'age_group'])])),
('classifier',
HistGradientBoostingClassifier(max_leaf_nodes=4,
random_state=42))])ColumnTransformer(remainder='passthrough',
transformers=[('cat_preprocessor',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1),
['gender', 'job_title',
'job_industry_category', 'wealth_segment',
'deceased_indicator', 'owns_car', 'state',
'age_group'])])['gender', 'job_title', 'job_industry_category', 'wealth_segment', 'deceased_indicator', 'owns_car', 'state', 'age_group']
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
passthrough
HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42)
accuracy = model_random_search.score(data_test, target_test)
print(f"The test accuracy score of the best model is {accuracy:.2f}")
The test accuracy score of the best model is 0.99
from pprint import pprint
print("The best parameters are:")
pprint(model_random_search.best_params_)
The best parameters are:
{'classifier__l2_regularization': 0.00019812287323236764,
'classifier__learning_rate': 0.1990528942243387,
'classifier__max_bins': 12,
'classifier__max_leaf_nodes': 165,
'classifier__min_samples_leaf': 93}
# get the parameter names
column_results = [f"param_{name}" for name in param_distributions.keys()]
column_results += ["mean_test_score", "std_test_score", "rank_test_score"]
cv_results = pd.DataFrame(model_random_search.cv_results_)
cv_results = cv_results[column_results].sort_values(
"mean_test_score", ascending=False
)
def shorten_param(param_name):
if "__" in param_name:
return param_name.rsplit("__", 1)[1]
return param_name
cv_results = cv_results.rename(shorten_param, axis=1)
cv_results
| l2_regularization | learning_rate | max_leaf_nodes | min_samples_leaf | max_bins | mean_test_score | std_test_score | rank_test_score | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.000198 | 0.199053 | 165 | 93 | 12 | 0.972324 | 0.004238 | 1 |
| 3 | 500.934714 | 0.474991 | 155 | 2 | 23 | 0.901873 | 0.004891 | 2 |
| 2 | 0.00956 | 1.13531 | 22 | 1 | 35 | 0.747030 | 0.039649 | 3 |
| 4 | 7.588438 | 0.001835 | 8 | 7 | 30 | 0.505877 | 0.008183 | 4 |
| 1 | 0.000154 | 1.238161 | 3 | 94 | 2 | 0.442938 | 0.026446 | 5 |
import seaborn as sns
import numpy as np
df = pd.DataFrame(
{
"max_leaf_nodes": cv_results["max_leaf_nodes"],
"learning_rate": cv_results["learning_rate"],
"score_bin": pd.cut(
cv_results["mean_test_score"], bins=np.linspace(0.5, 1.0, 6)
),
}
)
sns.set_palette("YlGnBu_r")
ax = sns.scatterplot(
data=df,
x="max_leaf_nodes",
y="learning_rate",
hue="score_bin",
s=50,
color="k",
edgecolor=None,
)
ax.set_xscale("log")
ax.set_yscale("log")
_ = ax.legend(
title="mean_test_score", loc="center left", bbox_to_anchor=(1, 0.5)
)
import numpy as np
import plotly.express as px
# Rest of your code
log10_func = np.vectorize(np.log10)
log2_func = np.vectorize(np.log2)
fig = px.parallel_coordinates(
cv_results.rename(shorten_param, axis=1).apply({
"learning_rate": log10_func,
"max_leaf_nodes": log2_func,
"min_samples_leaf": log2_func,
"max_bins": log10_func,
"l2_regularization": log10_func,
"mean_test_score": lambda x: x,
}),
color="mean_test_score",
color_continuous_scale=px.colors.sequential.Viridis,
)
fig.show()
# Deriving Predictions
predictions = Test.copy()
predictions["customer_category"] = model_random_search.predict(Test)
predictions.to_csv("Predictions.csv", index=False)
End Notes¶
This was a fun project, great hands on experience with data analysis and machine learning. I'll soon be writing a blog post on this, I'll link it here when it's done. I tend to start too many things and end up completeing only a few, a lot of things get dragged on for months, but I'm glad I completed this one. Also this was initially supposed to be a dashboarding task, will get a dashboard ready for this as well, might be great to embed it in the blog post, and on my website.
Take a look at my website
also, a huge thanks to INRIA for their MOOC on scikit learn, it was a great help, I've been looking for a solution of a similar problem but I didn't have any luck finding it on kaggle, so welp, feels great to come up with my own solution.
Thanks for reading, hope you enjoyed it, feel free to leave any feedback, I'm always looking to improve.